Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

نویسندگان

Lixin Shi

Jian-Yun Nie

چکیده

Due to the lack of explicit word boundaries in Chinese, and Japanese, and to some extent in Korean, an additional problem in IR in these languages is to determine the appropriate indexing units. For CLIR with these languages, we also need to determine translation units. Both words and ngrams of characters have been used in IR in these languages; however, only words have been used as translation units in previous studies. In this paper, we compare the utilization of words and n-grams for both monolingual and cross-lingual IR in these languages. Our experiments show that Chinese character n-grams are reasonable alternative indexing and translation units to words, and they lead to retrieval effectiveness comparable to or higher than words. For Japanese and Korean IR, bigrams or a combination of bigrams and unigrams produce the highest effectiveness.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The University of Amsterdam at NTCIR-5

We describe the University of Amsterdam’s participation in the Cross-Lingual Information Retrieval task at NTCIR-5. We focused on Chinese monolingual retrieval, and aimed to study the effectiveness of language models and different tokenization methods for Chinese. Our main findings are the following. First, where the vector space model excels on a bigram index, the language model performs poorl...

متن کامل

Phrasal Translation for English-Chinese Cross Language Information Retrieval

This paper introduces a simple and effective nonoverlapping unigram and bigram segmentation method for both monolingual Chinese and English-Chinese cross language retrieval. It also describes English-Chinese cross language retrieval experiments involving 54 topics and some 164,000 documents. The translation of English queries to Chinese is done using a Chinese-English dictionary of about 120,00...

متن کامل

Monolingual Experiments with Far-East Languages in NTCIR-6

This paper describes our third participation in an evaluation campaign involving the Chinese, Japanese and Korean languages (NTCIR-6). Our participation is motivated by three objectives: 1) study the retrieval performances of various probabilistic and language models for these languages; 2) compare the relative retrieval effectiveness of a combined “unigram & bigram” indexing scheme combined wi...

متن کامل

Cross Language Information Retrieval for Biomedical Literature

This workshop report discusses the collaborative work of UT, EMC and TNO on the TREC Genomics Track 2007. The biomedical information retrieval task is approached using cross language methods, in which biomedical concept detection is combined with effective IR based on unigram language models. Furthermore, a co-occurrence method is used to select and filter candidate answers. On its own, the cro...

متن کامل

Special Issue on Artificial Intelligence IJACSA Special Issue Guest Editor

The main aim of this study is to develop part-of-speech tagger for Afaan Oromo language. After reviewing literatures on Afaan Oromo grammars and identifying tagset and word categories, the study adopted Hidden Markov Model (HMM) approach and has implemented unigram and bigram models of Viterbi algorithm. Unigram model is used to understand word ambiguity in the language, while bigram model is u...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR

نویسندگان

چکیده

منابع مشابه

The University of Amsterdam at NTCIR-5

Phrasal Translation for English-Chinese Cross Language Information Retrieval

Monolingual Experiments with Far-East Languages in NTCIR-6

Cross Language Information Retrieval for Biomedical Literature

Special Issue on Artificial Intelligence IJACSA Special Issue Guest Editor

عنوان ژورنال:

اشتراک گذاری